The previous GenerateProbs and GenerateProbsPart2 focused on creating a CDF that can be indexed with a uniform random number to determine the coupon draw. But what if we went the other way? Each coupon occupies 1/N amount of space, but instead of a uniform random number, we draw a random variate from some distribution (probably beta) and that determines the likelihood that a coupon gets drawn. It's way easier to set up, but there may be a slowdown because we now draw random variates instead of uniform random numbers. Let's time a few options to generate beta distributed numbers.


In [1]:
%matplotlib inline
import numpy as np
from numpy.random import beta as npbeta
from random import betavariate as pybeta
from scipy.stats import beta as scibeta
from matplotlib import pyplot as plt
from numpy import arange, vectorize
import timeit

In [2]:
start = timeit.default_timer()
for i in np.arange(1000000):
    t = np.random.rand()
et = timeit.default_timer() - start
print(et)


0.30569263797849744

In [3]:
start = timeit.default_timer()
for i in np.arange(1000000):
    t = npbeta(1,1)
et = timeit.default_timer() - start
print(et)


0.5200948527564462

In [4]:
start = timeit.default_timer()
for i in np.arange(1000000):
    t = pybeta(1,1)
et = timeit.default_timer() - start
print(et)


2.150493072794671

In [5]:
start = timeit.default_timer()
for i in np.arange(1000000):
    t = scibeta.rvs(1,1)
et = timeit.default_timer() - start
print(et)


26.756046885535444

It looks like using random variates will make the sim run a little slower, but not by a crazy amount if we use numpy's variate generation. Stay away from pure python and scipy's version though, they'll kill the speed.

Now that we are drawing coupons using random variates instead of generating a cdf, we can simulate the ccp with a single function:


In [6]:
def single_run_looped(n, dist, alpha, beta):
    """
    This is a single run of the CCP.
    n = number of unique coupons
    m_max = max number of draws to simulate (should be much greater than n)
    dist = how are the coupon probabilities distributed (uniform, normal, exponential)
    norm_scale = how much to scale the normal distribution standard deviation (0<= norm_scale <1)
    """
    m = 0 #start at zero draws
    cdf = (arange(n)+1.0)/n #create the draw probability distribution
    draws = [] #create our draw array
    uniques = [] #create our unique array (deque is faster but may break DB inserts - research)
    unique = 0
    while True:
        m+=1 #increment our draw counter
        rv = npbeta(alpha, beta) #randomness that decides which coupon to draw
        draw = (cdf>rv).sum()
        if draw not in draws:
            draws.append(draw)
            unique+=1
        uniques.append(unique) #store the info
        if unique==n:#we'll stop once we have drawn all our coupons
            return m #this line returns the number of draws; for testing
            #return uniques #this line returns the full unique draws list; the actual data we want to record

vectorized_single_run_looped = vectorize(single_run_looped)

In [7]:
start = timeit.default_timer()
#test our sim with known results on uniform probs
trials = 200000
n = 10
records = vectorized_single_run_looped([n]*trials, ['beta']*trials, [1.0]*trials, [1.0]*trials)
num_fails = np.where(records==0)
average = np.mean(records)
std_error = np.std(records)/np.sqrt(trials)
z_crit = 1.96
low_ci = average - z_crit*std_error
high_ci = average + z_crit*std_error
expectation = np.asarray([1/(n+1) for n in np.arange(n)]).sum()*n
et = timeit.default_timer()-start
print ("num_fails: ", len(num_fails[0]))
print("low_ci: ", low_ci, "point_est: ", average, "high_ci: ", high_ci)
print("expected value: ", expectation)
print("elapsed_time: ", et)


num_fails:  0
low_ci:  29.2477963064 point_est:  29.296875 high_ci:  29.3459536936
expected value:  29.2896825397
elapsed_time:  24.108185512642645

So the version with random variates is ~10% slower, but that is a small price to pay for a coupon drawing system that "makes sense" and doesn't have pesky scaling and wierd math to make sure the probabilities come out right. Plus, we can use the beta distribution which is almost as nifty statistically as the normal distribution and we can easily iterate over alpha and beta values. Now all we need to do is figure out a way to store all this data and create a table with it.